Models that accurately predict properties based on chemical structure are valuable tools in drug discovery. However, for many properties, public and private training sets are typically small, and it is difficult for the models to generalize well outside of the training data. Recently, large language models have addressed this problem by using self-supervised pretraining on large unlabeled datasets, followed by fine-tuning on smaller, labeled datasets. In this paper, we report MolE, a molecular foundation model that adapts the DeBERTa architecture to be used on molecular graphs together with a two-step pretraining strategy. The first step of pretraining is a self-supervised approach focused on learning chemical structures, and the second step is a massive multi-task approach to learn biological information. We show that fine-tuning pretrained MolE achieves state-of-the-art results on 9 of the 22 ADMET tasks included in the Therapeutic Data Commons.
translated by 谷歌翻译
Distribution shifts-where the training distribution differs from the test distribution-can substantially degrade the accuracy of machine learning (ML) systems deployed in the wild. Despite their ubiquity in the real-world deployments, these distribution shifts are under-represented in the datasets widely used in the ML community today. To address this gap, we present Wilds, a curated benchmark of 10 datasets reflecting a diverse range of distribution shifts that naturally arise in real-world applications, such as shifts across hospitals for tumor identification; across camera traps for wildlife monitoring; and across time and location in satellite imaging and poverty mapping. On each dataset, we show that standard training yields substantially lower out-of-distribution than in-distribution performance. This gap remains even with models trained by existing methods for tackling distribution shifts, underscoring the need for new methods for training models that are more robust to the types of distribution shifts that arise in practice. To facilitate method development, we provide an open-source package that automates dataset loading, contains default model architectures and hyperparameters, and standardizes evaluations. Code and leaderboards are available at https://wilds.stanford.edu.
translated by 谷歌翻译
为了促进5G机器学习的使用,国际电信联盟(ITU)在2021年提议的第二版是5G挑战中ITU AI/ML的第二版,来自82个国家/地区的1600多名参与者。这项工作详细介绍了第二位解决方案总体上,这也是图形神经网络挑战2021的获胜解决方案。我们在将模型应用于5G网络时解决了概括问题,该模型可能比观察到的途径更长,链路容量更长且链接能力更大在培训中。为了实现这一目标,我们建议首先提取与排队理论(QT)相关的强大特征,然后使用Routenet Graph神经网络(GNN)模型的修改对分析基线预测进行微调。所提出的解决方案比简单地使用Routenet更好地概括了,并设法将分析基线的10.42平均绝对百分比误差降低到1.45(合奏为1.27)。这表明,对已知鲁棒的近似模型进行小更改可能是提高准确性的有效方法,而不会损害概括。
translated by 谷歌翻译
在机器人技术中,Visual Place识别是一个连续的过程,它作为输入视频流,以产生机器人在已知位置地图中的当前位置的假设。此任务需要针对实际应用的强大,可扩展和高效的技术。这项工作提出了使用顺序描述符对技术的详细分类法,突出了不同的机制,以融合各个图像的信息。实验结果的完整基准支持了这种分类,该基准提供了有关这些不同建筑选择的优势和劣势的证据。与现有的顺序描述方法相比,我们进一步研究了变压器而不是CNN骨架的生存能力,我们提出了一个名为SEQVLAD的新的临时序列级聚合器,该序列级别的聚合器在不同数据集中胜过先前的艺术状态。该代码可从https://github.com/vandal-vpr/vg-transformers获得。
translated by 谷歌翻译
半监督学习得到了研究人员的关注,因为它允许其中利用未标记数据的结构来实现比监督方法更少的标签来实现竞争分类结果。本地和全局一致性(LGC)算法是最着名的基于图形的半监督(GSSL)分类器之一。值得注意的是,其解决方案可以写成已知标签的线性组合。这种线性组合的系数取决于参数$ \ alpha $,在随机步行中达到标记的顶点时,确定随时间的衰减。在这项工作中,我们讨论如何删除标记实例的自我影响可能是有益的,以及它如何与休留次误差。此外,我们建议尽量减少自动分化的休假。在此框架内,我们提出了估计标签可靠性和扩散速率的方法。优化扩散速率以频谱表示更有效地完成。结果表明,标签可靠性方法与强大的L1-NORM方法竞争,删除对角线条目会降低过度的风险,并导致参数选择的合适标准。
translated by 谷歌翻译